Overview

Column

Research Questions/Purpose

• How much do offensive aspects in baseball affect a player’s salary?

• How much does Age affect a players salary?

• Which team generates the most RBIs?

• Which position in baseball makes the most money?

In this analysis I aimed to explain an MLB player’s salary based on their offensive statistics. This data set contains 478 observations of players and their offensive statistics and their respective salary from the year 2022. After using AIC variable selection and graphical analysis, this study found that RBI and Age contribute most to a players salary. Using OLS regression, I found that RBI and Age are statistically significant on all levels and a reported \(R^2 = 0.34342\).

Results

Column

Findings/Statistical Significance

My findings are: :D

Regression Summary

Column

Boxplot of Salary vs Player’s Age Group

Boxplot of Salary vs. RBIs

Variable Selection

Column

AIC Variable Selection

Start:  AIC=13238.3
Salary ~ 1

       Df  Sum of Sq        RSS   AIC
+ HR    1 4.2867e+15 1.9343e+16 13157
+ RBI   1 4.1456e+15 1.9484e+16 13160
+ BB    1 3.8543e+15 1.9775e+16 13166
+ Age   1 3.6114e+15 2.0018e+16 13171
+ SO    1 2.6136e+15 2.1016e+16 13191
+ OPS   1 1.8410e+15 2.1788e+16 13206
+ BA    1 1.3293e+15 2.2300e+16 13216
+ SB    1 4.7420e+14 2.3155e+16 13232
<none>               2.3629e+16 13238

Step:  AIC=13156.63
Salary ~ HR

       Df  Sum of Sq        RSS   AIC
+ Age   1 4.1526e+15 1.5190e+16 13058
+ BB    1 2.8801e+14 1.9055e+16 13152
+ RBI   1 1.3198e+14 1.9211e+16 13156
<none>               1.9343e+16 13157
+ SO    1 8.4287e+12 1.9334e+16 13158
+ BA    1 4.8628e+12 1.9338e+16 13158
+ OPS   1 1.6593e+11 1.9343e+16 13159
+ SB    1 8.9588e+10 1.9343e+16 13159

Step:  AIC=13057.61
Salary ~ HR + Age

       Df  Sum of Sq        RSS   AIC
+ BB    1 2.7845e+14 1.4912e+16 13052
+ RBI   1 1.9537e+14 1.4995e+16 13054
+ SB    1 1.4568e+14 1.5044e+16 13056
<none>               1.5190e+16 13058
+ BA    1 4.0571e+13 1.5150e+16 13058
+ OPS   1 1.4568e+13 1.5176e+16 13059
+ SO    1 7.9821e+11 1.5189e+16 13060

Step:  AIC=13051.87
Salary ~ HR + Age + BB

       Df  Sum of Sq        RSS   AIC
<none>               1.4912e+16 13052
+ SB    1 6.9871e+13 1.4842e+16 13052
+ SO    1 6.3024e+13 1.4849e+16 13052
+ RBI   1 4.3889e+13 1.4868e+16 13053
+ OPS   1 7.7980e+12 1.4904e+16 13054
+ BA    1 3.5841e+10 1.4912e+16 13054

Call:
lm(formula = Salary ~ HR + Age + BB, data = mlb_data)

Residuals:
      Min        1Q    Median        3Q       Max 
-18239809  -3504679   -825701   1667401  29570560 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -24199640    2466349  -9.812  < 2e-16 ***
HR             236228      47918   4.930 1.19e-06 ***
Age            895247      83473  10.725  < 2e-16 ***
BB              59629      21446   2.780  0.00568 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6002000 on 414 degrees of freedom
Multiple R-squared:  0.3689,    Adjusted R-squared:  0.3644 
F-statistic: 80.68 on 3 and 414 DF,  p-value: < 2.2e-16

Column

Scatter plot Matrix of Salary vs RBI + Age

Scatter plot Matrix of log(Salary) vs RBI + Age

Position Summary

Column

Position and Avg. Salary

Avg. Salary vs Position

Avg. RBI vs Position

Column

Analysis

:D :) :] :3

Team Summary

Column

Bar Chart of Avg Salary vs. Team

Bar Chart of Avg. RBIs vs. Team

Map

Column

Analysis

---
title: "EDA for MLB Salary"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: default
      navbar-bg: "#FFD580"
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

<style>
.chart-title 
  {  
    /* chart_title  */
    font-size: 18px;
  }
body
  {
      /* Normal  */
      font-size: 16px;
  }
</style>

```{css}
/* Set font color of inactive tab to green */
.nav-tabs-custom .nav-tabs > li > a 
  {
    color: green;
  } 

/* Set font color of active tab to red */
.nav-tabs-custom .nav-tabs > li.active > a 
  {
    color: red;
  } 

/* To set color on hover */
.nav-tabs-custom .nav-tabs > li.active > a:hover 
  {
    color: purple;
  }
```

```{r setup, include=FALSE}
library(flexdashboard)
```

Overview
===

Column {data-width=650}
-----------------------------------------------------------------------

### Research Questions/Purpose

• How much do offensive aspects in baseball affect a player's salary?

• How much does Age affect a players salary?

• Which team generates the most RBIs?

• Which position in baseball makes the most money?

In this analysis I aimed to explain an MLB player's salary based on their offensive statistics. This data set contains 478 observations of players and their offensive statistics and their respective salary from the year 2022. After using AIC variable selection and graphical analysis, this study found that RBI and Age contribute most to a players salary. Using OLS regression, I found that RBI and Age are statistically significant on all levels and a reported $R^2 = 0.34342$.

```{r}

```

```{r, data/packages}
library(pacman)
pacman::p_load(tidyverse, rvest, stringr, 
               ggplot2, forcats, writexl, 
               maps, viridis, scales, plotly)

mlb_salary <- read_csv("mlb_salary.csv")
#mlb_batting <- read_delim("C:\\Users\\altos\\Documents\\Datasets\\mlbb.txt", delim = ";")
mlb_batting <- read_csv("mlb_batting.csv")
#mlb_map <- read_csv("map_mlb.csv")

#colnames(mlb_map)[1] <- "Tm"

# Getting rid of observations where the playerwas traded, taking most recent trade
mlb_batting <- mlb_batting[!duplicated(mlb_batting$Name),]
mlb_batting <- mlb_batting[!mlb_batting$Tm == "TOT",]

# Cleaning name column
mlb_batting$Name <- gsub("\xa0", " ", mlb_batting$Name)
mlb_batting$Name <- gsub("<e9>", "e", mlb_batting$Name)
mlb_batting$Name <- gsub("#", "", mlb_batting$Name)
mlb_batting$Name <- gsub("<c1>", "A", mlb_batting$Name)
mlb_batting$Name <- gsub("<f3>", "o", mlb_batting$Name)
mlb_batting$Name <- gsub("<e1>", "a", mlb_batting$Name)
mlb_batting$Name <- gsub("<f1>", "n", mlb_batting$Name)
mlb_batting$Name <- gsub("<ed>", "i", mlb_batting$Name)
mlb_batting$Name <- stringr::str_replace(mlb_batting$Name, '\\*', '')

#mlb_salary$Player <- sub("(\\w+),\\s(\\w+)","\\2 \\1", mlb_salary$Player)
#colnames(mlb_salary)[1] <- "Name"
#mlb_salary <- subset(mlb_salary, select = -c(Average_Annual, Total_Value, Years))
#mlb_data <- merge(x = mlb_batting, y = mlb_salary, by = "Name")

mlb_batting <- mlb_batting %>%
  arrange(Name)

allnames <- unlist(str_split(mlb_salary$Player, ", "))
mlb_salary$First <- allnames[1:1942%%2==0]
mlb_salary$Last <- allnames[1:1942%%2==1]
mlb_salary$Name <- paste(mlb_salary$First, mlb_salary$Last)
mlb_salary$Name <- str_remove_all(mlb_salary$Name, "\\.")
mlb_batting$Name <- str_remove_all(mlb_batting$Name, "\\.")

mlb_salary <- mlb_salary %>%
  arrange(Name)

mlb_salary$Name[which(mlb_salary$Name=="A Minter.J.")] <- "Alex Minter"
mlb_salary$Name[which(mlb_salary$Name=="A Puk.J.")] <- "Andrew Puk"
mlb_salary$Name[which(mlb_salary$Name=="D'Travis Arnaud")] <- "Travis d'Arnaud"
mlb_salary$Name[which(mlb_salary$Name=="Abraham Toro-Hernandez")] <- "Abraham Toro"
mlb_salary$Name[which(mlb_salary$Name=="AJ Pollock IV")] <- "AJ Pollock"
mlb_salary$Name[which(mlb_salary$Name=="Alexander Colome")] <- "Alex Colome"
mlb_salary$Name[which(mlb_salary$Name=="Mullins Cedric II")] <- "Cedric Mullins"
mlb_salary$Name[which(mlb_salary$Name=="Christopher Martin")] <- "Chris Martin"
mlb_salary$Name[which(mlb_salary$Name=="Jazz Chisholm")] <- "Jazz Chisholm Jr"
mlb_salary$Name[which(mlb_salary$Name=="LaMonte Wade")] <- "LaMonte Wade Jr"
mlb_salary$Name[which(mlb_salary$Name=="Lourdes Gurriel")] <- "Lourdes Gurriel Jr"
mlb_salary$Name[which(mlb_salary$Name=="Vladimir Guerrero")] <- "Vladimir Guerrero Jr"
mlb_batting$Name[which(mlb_batting$Name=="Michael A Taylor")] <- "Michael Taylor"

mlb_salary <- mlb_salary %>%
  select(Name, Position, Salary)

mlb_data <- mlb_batting %>%
  left_join(mlb_salary, key = c(Name))

mlb_data <- mlb_data %>%
  filter(!is.na(Salary))

mlb_data[,"Age_Gp"] <- NA

mlb_data <- mlb_data %>%
  mutate(Age_Gp = case_when(
  Age >= 20 & Age < 25 ~ "20-25",
  Age >= 25 & Age < 30 ~ "25-30", Age >= 30 & Age < 35 ~ "30-35",
  Age >= 35 & Age < 40 ~ "35-40", Age >= 40 & Age < 45 ~ "40-45",
  Age >= 45 & Age < 50 ~ "45-50"))

mlb_data$Age_Gp <- as.factor(mlb_data$Age_Gp)

mlb_data[,"RBI_Gp"] <- NA

mlb_data <- mlb_data %>%
  mutate(RBI_Gp = case_when(
  RBI >= 0 & RBI < 30 ~ "0-30",
  RBI >= 30 & RBI < 60 ~ "30-60", RBI >= 60 & RBI < 90 ~ "60-90",
  RBI >= 90 & RBI < 120 ~ "90-120", RBI >= 120 ~ "120+"))

mlb_data$RBI_Gp <- as.factor(mlb_data$RBI_Gp)

avg_p <- mlb_data %>%
  group_by(Position) %>%
  summarise(Avg_RBI = mean(RBI), Avg_Sal = mean(Salary))

avg_tm <- mlb_data %>%
  group_by(Tm) %>%
  summarise(Avg_Sal = mean(Salary), Avg_RBI = mean(RBI))

avg_p$Position <- as.factor(avg_p$Position)
avg_tm$Tm <- as.factor(avg_tm$Tm)

#avg_tm <- avg_tm %>%
  #left_join(mlb_map, key = c(Tm))

avg_p <- avg_p[-c(5,10),]
```

Results
===

Column {data-width=650}
-----------------------------------------------------------------------

### Findings/Statistical Significance

My findings are: :D

### Regression Summary

```{r results}
# Efficient Model
# fit.mlb <- lm(Salary ~ Age + RBI, data = mlb_data)
# summary(fit.mlb)

# AIC Model
fit.mlb2 <- lm(Salary ~ BB + Age + HR, data = mlb_data)
S2 <- summary(fit.mlb2)
```

```{r fig.align='center', out.width="80%"}
knitr::include_graphics("Linear.jpg")
```

Column {.tabset data-width=650}
-----------------------------------------------------------------------

### Boxplot of Salary vs Player's Age Group

```{r}
ggplot(mlb_data, aes(x = Age_Gp, y = Salary)) + 
  geom_boxplot(color = "#81d4fa", fill = "#03a9f4") +
  labs(x = "Age Group") +
  scale_y_continuous(labels = label_comma()) +
  theme_classic()
```

### Boxplot of Salary vs. RBIs

```{r}
mlb_data %>%
  mutate(class = fct_reorder(RBI_Gp, Salary, na.rm = T, .fun = 'median')) %>%
ggplot(aes(x = reorder(RBI_Gp, Salary), y = Salary)) + 
  geom_boxplot(fill = "#ffd9b3", color = "#ffa64d") +
  labs(x = "Number of RBIs", y = "Salary") +
  scale_y_continuous(labels = label_comma()) + 
  theme_classic()
```

Variable Selection
===

Column {data-width=650}
-----------------------------------------------------------------------

### AIC Variable Selection 

```{r}
fit.null <- lm(Salary ~ 1, data = mlb_data)
fit.AIC <- step(fit.null, scope = Salary ~ Age + HR + RBI + SB + BB + SO + BA + OPS, direction = "forward", k = 2)
fit.a <- lm(Salary ~ HR + Age + BB, data = mlb_data)
summary(fit.a)
```

Column {data-width=650}
-----------------------------------------------------------------------

### Scatter plot Matrix of Salary vs RBI + Age

```{r}
pairs(~log(Salary) + Age + HR + BB, data = mlb_data)
```

### Scatter plot Matrix of log(Salary) vs RBI + Age

```{r}
pairs(~log(Salary) + Age + RBI, data = mlb_data)
```

Position Summary
===

Column {.tabset data-width=650}
-----------------------------------------------------------------------

### Position and Avg. Salary

```{r graph (stadium)}
devtools::install_github("bdilday/GeomMLBStadiums")
library(GeomMLBStadiums)

mlb_position <- as.data.frame(rep(NA, 7))
colnames(mlb_position)[1] <- "Position"
mlb_position[,"Latitude"] <- NA 
mlb_position[,"Longitude"] <- NA
mlb_position[,"Avg_Salary"] <- NA
mlb_position$Avg_Salary <- c(7575904, 4294942, 9047747, 
                             6954996, 3414633, 3330355, 
                             5636252)

mlb_position$Position <- c("1st Base", "2nd Base", "3rd Base",
                           "Shortstop (SS)", "Pitcher", "Catcher",
                           "Outfield (OF)")

mlb_position$Latitude <- c(72,40,-72,-40,1,1,1)
mlb_position$Longitude <- c(85,115,85,115,70,-20,270)
                      
p <- ggplot(mlb_position, aes(x = Latitude, y = Longitude, text = paste0(Position, ":\n", "Mean Salary: ", Avg_Salary))) +
  geom_spraychart(stadium_transform_coords = TRUE, stadium_segments = "all", stadium_ids = "reds") +
  coord_fixed() +
  theme_void() + 
  theme(axis.title.x=element_blank(), axis.text.x=element_blank(),
        axis.ticks.x=element_blank(), axis.title.y=element_blank(),
        axis.text.y =element_blank(), axis.ticks.y=element_blank(), 
        panel.grid.major = element_blank(), 
        panel.background = element_blank()) +
  geom_point(aes(size = Avg_Salary), color = "#FFD580")

font <- list(family = "Mono", size = 15, color = "black")
label <- list(bgcolor = "#FFD580", font = font)

ggplotly(p, tooltip = "text") %>%
  style(hoverlabel = label) %>%
  layout(font = font)
```

### Avg. Salary vs Position

```{r bar}
p2 <- ggplot(avg_p, aes(x = Position, y = Avg_Sal), text = ) + 
  geom_bar(stat = "identity", color = "orange", fill = "#FFA07A") +
  scale_y_continuous(labels = label_comma()) +
  labs(y = "Avg. Salary") +
  theme_classic()

ggplotly(p2, tooltip = "text")
```

### Avg. RBI vs Position

```{r}
ggplot(avg_p, aes(x = Position, y = Avg_RBI)) +
  geom_histogram(stat = "identity", color = "#FFA07A", fill = "#DE3163") +
  labs(y = "Avg. RBI") +
  theme_classic() 
```

Column {data-width=550}
-----------------------------------------------------------------------

### Analysis

:D :) :] :3

Team Summary
===

Column {.tabset data-length=650}
-----------------------------------------------------------------------

### Bar Chart of Avg Salary vs. Team

```{r map, fig.align='center', out.width="80%"}
options(scipen = 10)
avg_tm$label <- paste0(avg_tm$Tm, ":\n",
                       "Average Salary: $", round(avg_tm$Avg_Sal))
p_salary <- ggplot(avg_tm, aes(y = reorder(Tm, Avg_Sal), x = Avg_Sal, text = label)) +
  geom_bar(stat = "identity", fill = "#F1959B") +
  labs(y = "Team", x = "Avg. Salary") +
  theme_classic() +
  scale_x_continuous(label= comma, limits=c(0, 13000000))
ggplotly(p_salary, tooltip = "text")
```

### Bar Chart of Avg. RBIs vs. Team

```{r map2, fig.align='center', out.width="80%"}
ggplot(avg_tm, aes(y = reorder(Tm, Avg_RBI), x = Avg_RBI)) +
  geom_histogram(stat = "identity", fill = "#ff8164") +
  labs(y = "Team", x = "Avg. RBI") +
  theme_classic()
```

### Map 

```{r}

```

Column {data-length=650}
-----------------------------------------------------------------------

### Analysis

```{r}
library(shiny)
tags$iframe(style="height:1000px; width:100%; scrolling=yes",
src="MTH 207 comments from Becky.pdf")
```